Using readxl

The Massachusetts Bay Transportation Authority (“MBTA” or just “the T” for short) manages America’s oldest subway, as well as Greater Boston’s commuter rail, ferry, and bus systems.

It’s your first day on the job as the T’s data analyst and you’ve been tasked with analyzing average ridership through time. You’re in luck, because this chapter of the course will guide you through cleaning a set of MBTA ridership data!

The dataset is stored as an Excel spreadsheet called mbta.xlsx in your working directory. You’ll use the read_excel() function from Hadley Wickham’s readxl package to import it.

The first time you import a dataset, you might not know how many rows need to be skipped. In this case, the first row is a title, so you’ll need to skip the first row.

# Load readxl
library(readxl)

# Import mbta.xlsx and skip first row: mbta
mbta <- read_excel("../xDatasets/mbta.xlsx", skip = 1)

Notice that the first row isn’t actually data, but specifying skip = 1 fixes that.

Examining the data

Your new boss at the T has tasked you with analyzing the ridership data. Of course, you’re going to clean the dataset first. The first step when cleaning a dataset is to explore it a bit.

The mbta data frame is already loaded in your workspace. Pay particular attention to how the rows and columns are organized and to the locations of missing values.

# View the structure of mbta
str(mbta, give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    11 obs. of  60 variables:
##  $ ..1    : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ mode   : chr  "All Modes by Qtr" "Boat" "Bus" "Commuter Rail" ...
##  $ 2007-01: chr  "NA" "4" "335.819" "142.2" ...
##  $ 2007-02: chr  "NA" "3.6" "338.675" "138.5" ...
##  $ 2007-03: num  1188 40 340 138 459 ...
##  $ 2007-04: chr  "NA" "4.3" "352.162" "139.5" ...
##  $ 2007-05: chr  "NA" "4.9" "354.367" "139" ...
##  $ 2007-06: num  1246 5.8 350.5 143 477 ...
##  $ 2007-07: chr  "NA" "6.521" "357.519" "142.391" ...
##  $ 2007-08: chr  "NA" "6.572" "355.479" "142.364" ...
##  $ 2007-09: num  1256.57 5.47 372.6 143.05 499.57 ...
##  $ 2007-10: chr  "NA" "5.145" "368.847" "146.542" ...
##  $ 2007-11: chr  "NA" "3.763" "330.826" "145.089" ...
##  $ 2007-12: num  1216.89 2.98 312.92 141.59 448.27 ...
##  $ 2008-01: chr  "NA" "3.175" "340.324" "142.145" ...
##  $ 2008-02: chr  "NA" "3.111" "352.905" "142.607" ...
##  $ 2008-03: num  1253.52 3.51 361.15 137.45 494.05 ...
##  $ 2008-04: chr  "NA" "4.164" "368.189" "140.389" ...
##  $ 2008-05: chr  "NA" "4.015" "363.903" "142.585" ...
##  $ 2008-06: num  1314.82 5.19 362.96 142.06 518.35 ...
##  $ 2008-07: chr  "NA" "6.016" "370.921" "145.731" ...
##  $ 2008-08: chr  "NA" "5.8" "361.057" "144.565" ...
##  $ 2008-09: num  1307.04 4.59 389.54 141.91 517.32 ...
##  $ 2008-10: chr  "NA" "4.285" "357.974" "151.957" ...
##  $ 2008-11: chr  "NA" "3.488" "345.423" "152.952" ...
##  $ 2008-12: num  1232.65 3.01 325.77 140.81 446.74 ...
##  $ 2009-01: chr  "NA" "3.014" "338.532" "141.448" ...
##  $ 2009-02: chr  "NA" "3.196" "360.412" "143.529" ...
##  $ 2009-03: num  1209.79 3.33 353.69 142.89 467.22 ...
##  $ 2009-04: chr  "NA" "4.049" "359.38" "142.34" ...
##  $ 2009-05: chr  "NA" "4.119" "354.75" "144.225" ...
##  $ 2009-06: num  1233.1 4.9 347.9 142 473.1 ...
##  $ 2009-07: chr  "NA" "6.444" "339.477" "137.691" ...
##  $ 2009-08: chr  "NA" "5.903" "332.661" "139.158" ...
##  $ 2009-09: num  1230.5 4.7 374.3 139.1 500.4 ...
##  $ 2009-10: chr  "NA" "4.212" "385.868" "137.104" ...
##  $ 2009-11: chr  "NA" "3.576" "366.98" "129.343" ...
##  $ 2009-12: num  1207.85 3.11 332.39 126.07 440.93 ...
##  $ 2010-01: chr  "NA" "3.207" "362.226" "130.91" ...
##  $ 2010-02: chr  "NA" "3.195" "361.138" "131.918" ...
##  $ 2010-03: num  1208.86 3.48 373.44 131.25 483.4 ...
##  $ 2010-04: chr  "NA" "4.452" "378.611" "131.722" ...
##  $ 2010-05: chr  "NA" "4.415" "380.171" "128.8" ...
##  $ 2010-06: num  1244.41 5.41 363.27 129.14 490.26 ...
##  $ 2010-07: chr  "NA" "6.513" "353.04" "122.935" ...
##  $ 2010-08: chr  "NA" "6.269" "343.688" "129.732" ...
##  $ 2010-09: num  1225.5 4.7 381.6 132.9 521.1 ...
##  $ 2010-10: chr  "NA" "4.402" "384.987" "131.033" ...
##  $ 2010-11: chr  "NA" "3.731" "367.955" "130.889" ...
##  $ 2010-12: num  1216.26 3.16 326.34 121.42 450.43 ...
##  $ 2011-01: chr  "NA" "3.14" "334.958" "128.396" ...
##  $ 2011-02: chr  "NA" "3.284" "346.234" "125.463" ...
##  $ 2011-03: num  1223.45 3.67 380.4 134.37 516.73 ...
##  $ 2011-04: chr  "NA" "4.251" "380.446" "134.169" ...
##  $ 2011-05: chr  "NA" "4.431" "385.289" "136.14" ...
##  $ 2011-06: num  1302.41 5.47 376.32 135.58 529.53 ...
##  $ 2011-07: chr  "NA" "6.581" "361.585" "132.41" ...
##  $ 2011-08: chr  "NA" "6.733" "353.793" "130.616" ...
##  $ 2011-09: num  1291 5 388 137 550 ...
##  $ 2011-10: chr  "NA" "4.484" "398.456" "128.72" ...
# View the first 6 rows of mbta
mbta %>% 
  head() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
..1 mode 2007-01 2007-02 2007-03 2007-04 2007-05 2007-06 2007-07 2007-08 2007-09 2007-10 2007-11 2007-12 2008-01 2008-02 2008-03 2008-04 2008-05 2008-06 2008-07 2008-08 2008-09 2008-10 2008-11 2008-12 2009-01 2009-02 2009-03 2009-04 2009-05 2009-06 2009-07 2009-08 2009-09 2009-10 2009-11 2009-12 2010-01 2010-02 2010-03 2010-04 2010-05 2010-06 2010-07 2010-08 2010-09 2010-10 2010-11 2010-12 2011-01 2011-02 2011-03 2011-04 2011-05 2011-06 2011-07 2011-08 2011-09 2011-10
1 All Modes by Qtr NA NA 1187.653 NA NA 1245.959 NA NA 1256.571 NA NA 1216.890 NA NA 1253.522 NA NA 1314.821 NA NA 1307.041 NA NA 1232.655 NA NA 1209.792 NA NA 1233.085 NA NA 1230.461 NA NA 1207.845 NA NA 1208.857 NA NA 1244.409 NA NA 1225.527 NA NA 1216.262 NA NA 1223.452 NA NA 1302.414 NA NA 1290.549 NA
2 Boat 4 3.6 40.000 4.3 4.9 5.800 6.521 6.572 5.469 5.145 3.763 2.985 3.175 3.111 3.512 4.164 4.015 5.189 6.016 5.8 4.587 4.285 3.488 3.007 3.014 3.196 3.330 4.049 4.119 4.900 6.444 5.903 4.696 4.212 3.576 3.113 3.207 3.195 3.481 4.452 4.415 5.411 6.513 6.269 4.699 4.402 3.731 3.156 3.14 3.284 3.674 4.251 4.431 5.474 6.581 6.733 5.003 4.484
3 Bus 335.819 338.675 339.867 352.162 354.367 350.543 357.519 355.479 372.598 368.847 330.826 312.920 340.324 352.905 361.155 368.189 363.903 362.962 370.921 361.057 389.537 357.974 345.423 325.767 338.532 360.412 353.686 359.38 354.75 347.865 339.477 332.661 374.260 385.868 366.98 332.394 362.226 361.138 373.443 378.611 380.171 363.275 353.04 343.688 381.622 384.987 367.955 326.338 334.958 346.234 380.399 380.446 385.289 376.317 361.585 353.793 388.271 398.456
4 Commuter Rail 142.2 138.5 137.700 139.5 139 143.000 142.391 142.364 143.051 146.542 145.089 141.585 142.145 142.607 137.453 140.389 142.585 142.057 145.731 144.565 141.907 151.957 152.952 140.810 141.448 143.529 142.893 142.34 144.225 142.006 137.691 139.158 139.087 137.104 129.343 126.066 130.91 131.918 131.252 131.722 128.8 129.144 122.935 129.732 132.892 131.033 130.889 121.422 128.396 125.463 134.374 134.169 136.14 135.581 132.41 130.616 136.901 128.72
5 Heavy Rail 435.294 448.271 458.583 472.201 474.579 477.032 471.735 461.605 499.566 457.741 488.348 448.268 472.624 492.1 494.046 513.204 507.952 518.349 512.309 476.99 517.324 523.644 487.115 446.743 461.004 482.407 467.224 493.152 475.634 473.099 470.828 466.676 500.403 513.406 480.278 440.925 464.069 480.121 483.397 502.374 487.4 490.263 488.587 473.731 521.099 532.403 502.887 450.433 468.418 504.068 516.730 528.631 528.122 529.528 532.888 508.145 550.137 554.932
6 Light Rail 227.231 240.262 241.444 255.557 248.262 246.108 243.286 234.907 265.748 241.434 250.497 233.379 241.223 249.306 253.132 271.07 258.351 266.961 270.158 239.344 258.171 250.063 232.068 205.420 215.66 228.737 222.844 238.232 224.962 226.259 230.308 231.783 250.922 230.739 214.711 194.446 204.396 213.136 211.693 227.246 217.805 215.922 218.729 210.53 236.368 236.366 221.881 196.211 198.45 219.886 227.935 242.28 225.776 221.865 231.01 220.164 244.949 237.768
# View a summary of mbta
sum_mbta <- as.data.frame(do.call(cbind, lapply(mbta, summary)))

sum_mbta[,-1] %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
mode 2007-01 2007-02 2007-03 2007-04 2007-05 2007-06 2007-07 2007-08 2007-09 2007-10 2007-11 2007-12 2008-01 2008-02 2008-03 2008-04 2008-05 2008-06 2008-07 2008-08 2008-09 2008-10 2008-11 2008-12 2009-01 2009-02 2009-03 2009-04 2009-05 2009-06 2009-07 2009-08 2009-09 2009-10 2009-11 2009-12 2010-01 2010-02 2010-03 2010-04 2010-05 2010-06 2010-07 2010-08 2010-09 2010-10 2010-11 2010-12 2011-01 2011-02 2011-03 2011-04 2011-05 2011-06 2011-07 2011-08 2011-09 2011-10
Min. 11 11 11 0.114 11 11 0.096 11 11 -0.007 11 11 -0.06 11 11 0.058 11 11 0.06 11 11 0.021 11 11 -0.015 11 11 -0.05 11 11 -0.079 11 11 -0.035 11 11 -0.022 11 11 0.012 11 11 0.008 11 11 0.001 11 11 -0.004 11 11 0.05 11 11 0.054 11 11 0.043 11
1st Qu. character character character 9.2785 character character 5.7 character character 5.539 character character 4.385 character character 5.1695 character character 5.7415 character character 5.6905 character character 4.6885 character character 5.0025 character character 5.845 character character 5.693 character character 4.7835 character character 5.2735 character character 6.4355 character character 5.5675 character character 4.4665 character character 6.0305 character character 6.9255 character character 6.6605 character
Median character character character 137.7 character character 143 character character 143.051 character character 141.585 character character 137.453 character character 142.057 character character 141.907 character character 140.81 character character 142.893 character character 142.006 character character 139.087 character character 126.066 character character 131.252 character character 129.144 character character 132.892 character character 121.422 character character 134.374 character character 135.581 character character 136.901 character
Mean 11 11 11 330.292454545455 11 11 339.846545454545 11 11 352.553727272727 11 11 321.587636363636 11 11 345.603818181818 11 11 359.666909090909 11 11 362.099363636364 11 11 319.882545454545 11 11 330.142363636364 11 11 333.194090909091 11 11 346.686727272727 11 11 312.962545454545 11 11 332.725454545455 11 11 335.964 11 11 346.523818181818 11 11 312.917090909091 11 11 345.165545454545 11 11 353.330727272727 11 11 362.554545454545 11
3rd Qu. character character character 399.225 character character 413.7875 character character 436.082 character character 380.594 character character 427.6005 character character 440.6555 character character 453.4305 character character 386.255 character character 410.455 character character 410.482 character character 437.3315 character character 386.6595 character character 428.42 character character 426.769 character character 451.3605 character character 388.3855 character character 448.5645 character character 452.9225 character character 469.204 character
Max. character character character 1204.725 character character 1246.129 character character 1310.764 character character 1216.89 character character 1274.031 character character 1320.728 character character 1338.015 character character 1232.655 character character 1210.912 character character 1233.085 character character 1291.564 character character 1207.845 character character 1225.556 character character 1244.409 character character 1293.117 character character 1216.262 character character 1286.66 character character 1302.414 character character 1348.754 character

Do you notice anything strange about how the rows and columns are organized?

Removing unnecessary rows and columns

It appears that the data are organized with observations stored as columns rather than as rows. You can fix that.

First, though, you can address the missing data. All of the NA values are stored in the All Modes by Qtr row. This row really belongs in a different data frame; it is a quarterly average of weekday MBTA ridership. Since this dataset tracks monthly average ridership, you’ll remove that row.

Similarly, the 7th row (Pct Chg / Yr) and the 11th row (TOTAL) are not really observations as much as they are analysis. Go ahead and remove the 7th and 11th rows as well.

The first column also needs to be removed because it’s just listing the row numbers.

In case you were wondering, this dataset is stored as a tibble which is just a specific type of data frame.

# Remove rows 1, 7, and 11 of mbta: mbta2
mbta2 <- mbta[-c(1,7,11),]

# Remove the first column of mbta2: mbta3
mbta3 <- mbta2[,-1]

Observations are stored in columns

Recall from a few exercises back that in your T ridership data, variables are stored in rows instead of columns. If you forget what that looked like, go ahead and enter head(mbta3) in the console and/or look at the screenshot.

The different modes of transportation (commuter rail, bus, subway, ferry, …) are variables, providing information about each month’s average ridership. The months themselves are observations. You can tell which is which because as you go through time, the month changes, but the modes of transport offered by the T do not.

As is customary, you want to represent variables in columns rather than rows. The first step is to use the gather() function from the tidyr package, which will gather columns into key-value pairs.

# Load tidyr
library(tidyr)

# Gather columns of mbta3: mbta4
mbta4 <- gather(mbta3, month, thou_riders, -mode)

# View the head of mbta4
mbta4 %>% 
  head(10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
mode month thou_riders
Boat 2007-01 4
Bus 2007-01 335.819
Commuter Rail 2007-01 142.2
Heavy Rail 2007-01 435.294
Light Rail 2007-01 227.231
Private Bus 2007-01 4.772
RIDE 2007-01 4.9
Trackless Trolley 2007-01 12.757
Boat 2007-02 3.6
Bus 2007-02 338.675

Before moving on, have a look at the console. Your dataset is long now – notice that the first column still stores variable names, but now, months have their own column instead of being headers. Also notice the data type of each column.

Type conversions

In a minute, you’ll put variables where they belong (as column names). But first, take this opportunity to change the average weekday ridership column, thou_riders, into numeric values rather than character strings. That way, you’ll be able to do things like compare values and do math.

# Coerce thou_riders to numeric
mbta4$thou_riders <- as.numeric(mbta4$thou_riders)

Variables are stored in both rows and columns

Now, you can finish the job you started earlier: getting variables into columns. Right now, variables are stored as “keys” in the mode column. You’ll use the tidyr function spread() to make them into columns containing average weekday ridership for the given month and mode of transport.

# Spread the contents of mbta4: mbta5
mbta5 <- spread(mbta4, mode, thou_riders)

# View the head of mbta5
mbta5 %>% 
  head(10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
month Boat Bus Commuter Rail Heavy Rail Light Rail Private Bus RIDE Trackless Trolley
2007-01 4.000 335.819 142.200 435.294 227.231 4.772 4.900 12.757
2007-02 3.600 338.675 138.500 448.271 240.262 4.417 5.000 12.913
2007-03 40.000 339.867 137.700 458.583 241.444 4.574 5.500 13.057
2007-04 4.300 352.162 139.500 472.201 255.557 4.542 5.400 13.444
2007-05 4.900 354.367 139.000 474.579 248.262 4.768 5.400 13.479
2007-06 5.800 350.543 143.000 477.032 246.108 4.722 5.600 13.323
2007-07 6.521 357.519 142.391 471.735 243.286 3.936 5.253 13.311
2007-08 6.572 355.479 142.364 461.605 234.907 3.946 5.308 13.142
2007-09 5.469 372.598 143.051 499.566 265.748 4.329 5.609 14.393
2007-10 5.145 368.847 146.542 457.741 241.434 4.315 5.806 14.622

Notice how mbta5 now has columns for modes of transportation with values of average weekday ridership for the given months.

Separating columns

Your dataset is already looking much better! Your boss saw what a great job you’re doing and now wants you to do an analysis of the T’s ridership during certain months across all years.

Your dataset has month names in it, so that analysis will be a piece of cake. There’s only one small problem: if you want to look at ridership on the T during every January (for example), the month and year are together in the same column, which makes it a little tricky.

In this exercise, you’ll separate the month column into distinct month and year columns to make life easier.

# Split month column into month and year: mbta6
mbta6 <- separate(mbta5, month, c("year", "month"))

# View the head of mbta6
mbta6 %>% 
  head(10) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
year month Boat Bus Commuter Rail Heavy Rail Light Rail Private Bus RIDE Trackless Trolley
2007 01 4.000 335.819 142.200 435.294 227.231 4.772 4.900 12.757
2007 02 3.600 338.675 138.500 448.271 240.262 4.417 5.000 12.913
2007 03 40.000 339.867 137.700 458.583 241.444 4.574 5.500 13.057
2007 04 4.300 352.162 139.500 472.201 255.557 4.542 5.400 13.444
2007 05 4.900 354.367 139.000 474.579 248.262 4.768 5.400 13.479
2007 06 5.800 350.543 143.000 477.032 246.108 4.722 5.600 13.323
2007 07 6.521 357.519 142.391 471.735 243.286 3.936 5.253 13.311
2007 08 6.572 355.479 142.364 461.605 234.907 3.946 5.308 13.142
2007 09 5.469 372.598 143.051 499.566 265.748 4.329 5.609 14.393
2007 10 5.145 368.847 146.542 457.741 241.434 4.315 5.806 14.622

Do your values seem reasonable?

Before you write up the analysis for your boss, it’s a good idea to screen the data for any obvious mistakes and/or outliers.

There are many valid techniques for doing this; you’ll practice a couple of them here.

# View a summary of mbta6
sum_mbta6 <- as.data.frame(do.call(cbind, lapply(mbta6, summary)))

sum_mbta6[,-2] %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
year Boat Bus Commuter Rail Heavy Rail Light Rail Private Bus RIDE Trackless Trolley
Min. 58 2.985 312.92 121.422 435.294 194.446 2.213 4.9 5.777
1st Qu. character 3.494 345.62575 131.3695 471.05475 220.58925 2.64075 5.9655 11.67925
Median character 4.2925 359.896 138.75 487.2575 231.9255 2.8195 6.615 12.5975
Mean 58 5.06796551724138 358.590448275862 137.351534482759 489.293482758621 232.988810344828 3.35218965517241 6.6036724137931 12.1250517241379
3rd Qu. character 5.3555 372.17875 142.358 511.268 244.53325 4.16725 7.1495 13.32
Max. character 40 398.456 152.952 554.932 271.07 4.878 8.598 15.109
# Generate a histogram of Boat column
hist(mbta6$Boat)

That’s quite an interesting histogram – every value clustered around 4 and one loner out around 40.

Dealing with entry error

Think for a minute about that Boat histogram. Every month, average weekday commuter boat ridership was on either side of four thousand. Then, one month it jumped to 40 thousand without warning?

Unless the Olympics were happening in Boston that month (they weren’t), this value is certainly an error. You can assume that whoever was entering the data that month accidentally typed 40 instead of 4.

Because it’s an error, you don’t want this value influencing your analysis. In this exercise, you’ll locate the incorrect value and change it to 4.

After you make the change, you’ll run the last two commands in the editor as-is. They use functions you may not know yet to produce some cool ridership plots: one showing the lesser-used modes of transport (take a look at the gorgeous seasonal variation in Boat ridership), and one showing all modes of transport. The plots are based on the long version of the data we produced in Exercise 4 – a good example of using different data formats for different purposes.

# Find the row number of the incorrect value: i
i <- which(mbta6$Boat > 20)

# Replace the incorrect value with 4
mbta6$Boat[i] <- 4

# Generate a histogram of Boat column
hist(mbta6$Boat)

# Look at Boat and Trackless Trolley ridership over time (don't change)
j <- which(mbta4$mode == "Boat" | mbta4$mode == "Trackless Trolley")
mbta_boat <- mbta4[j, c("month", "thou_riders", "mode")]
        
ggplot(mbta_boat, aes(x = month, y = thou_riders, col = mode)) +  geom_point() + 
  scale_x_discrete(name = "Month", breaks = c(200701, 200801, 200901, 201001, 201101)) + 
  scale_y_continuous(name = "Avg Weekday Ridership (thousands)")

# Look at all T ridership over time (don't change)
mbta_all <- mbta4[, c("month", "thou_riders", "mode")]

ggplot(mbta_all, aes(x = month, y = thou_riders, col = mode)) + geom_point() + 
  scale_x_discrete(name = "Month", breaks = c(200701, 200801, 200901, 201001, 201101)) +  
  scale_y_continuous(name = "Avg Weekday Ridership (thousands)")